[SPARK-20159][SPARKR][SQL] Support all catalog API in R#17483
[SPARK-20159][SPARKR][SQL] Support all catalog API in R#17483felixcheung wants to merge 8 commits intoapache:masterfrom
Conversation
| #' @param tableName a name of the table. | ||
| #' @param path the path of files to load. | ||
| #' @param source the name of external data source. | ||
| #' @param schema the schema of the data for certain data source. |
There was a problem hiding this comment.
this is added to createExternalTable (won't work with json source otherwise)
everything else is simply moved as-is from SQLContext.R
|
Test build #75393 has finished for PR 17483 at commit
|
|
cc @gatorsmile for any SQL specific inputs @felixcheung I will take a look at this later today. Meanwhile in the PR description can you note down what are the new functions being added in this change ? |
| dataFrame(sdf) | ||
| } | ||
|
|
||
| createExternalTable <- function(x, ...) { |
There was a problem hiding this comment.
Just FYI, createExternalTable is deprecated. See the PR: #16528
Let me make the corresponding changes in SQLContext too.
There was a problem hiding this comment.
yes, I'm aware. I'm not sure if we need to make changes to SQLContext - schema is required for json source but it's ok in Scala to use createTable instead.
It makes sense to add in R here though because
- there is no createTable method (hmm, reviewing that PR you reference, maybe we should add it)
- createTable just sounds too generic and too much like many existing R methods (in R, table is everywhere!), that I wasn't sure it's a good idea to add in R
- createExternalTable since 2.0 is decoupled from SQLContext or SparkSession - it doesn't take either as parameter and it's calling on catalog
There was a problem hiding this comment.
I agree that createTable sounds very general, but I dont think its used by base R or any popular R package ?
There was a problem hiding this comment.
createExternalTable is misleading now, because the table could be managed if users did not provide the value of path. Thus, we decided to rename it.
There was a problem hiding this comment.
right, I was just concerned that with data.table, read.table etc, table == data.frame in R as supposed to hive table or managed table, which could be fairly confusing.
anyway, I think I'll follow up with a PR for createTable but as of now path is optional for createExternalTable, even though it's potentially misleading, it does work now.
There was a problem hiding this comment.
If you drop a managed table both data and meta data will be deleted if you drop an external table only metadata is deleted, external table is a way to protect data against accidental drop commands.
Thus, it is a pretty important concept. It could be either Hive or Spark native one.
|
updated PR description! |
|
btw, @gatorsmile it looks like |
|
Test build #75417 has finished for PR 17483 at commit
|
|
Test build #75418 has finished for PR 17483 at commit
|
|
Test build #75420 has finished for PR 17483 at commit
|
|
Test build #75422 has finished for PR 17483 at commit
|
shivaram
left a comment
There was a problem hiding this comment.
Some comments. Change looks good otherwise
| createOrReplaceTempView(df, "table1") | ||
| expect_equal(length(tableNames()), 1) | ||
| tables <- tables() | ||
| tables <- listTables() |
There was a problem hiding this comment.
is tables() deprecated now ?
There was a problem hiding this comment.
right, there are some differences of the output (most notability catalog.listTables returns a Dataset<Table> - but I'm converting that into a DataFrame anyway), and I thought list* would be more consistent with other methods like listColumn(), listDatabases()
> head(tables("default"))
database tableName isTemporary
1 default json FALSE
Warning message:
'tables' is deprecated.
Use 'listTables' instead.
See help("Deprecated")
> head(listTables("default"))
name database description tableType isTemporary
1 json default <NA> EXTERNAL FALSE
If you think it makes sense, we could make tables an alias of listTables - it's going to call slightly different code on the Scala side and there are new columns and one different column name being returned.
| "org.apache.spark.sql.catalyst.expressions.Abs") | ||
| expect_error(listFunctions("foo_db"), | ||
| "Error in listFunctions : analysis error - Database 'foo_db' does not exist") | ||
| }) |
There was a problem hiding this comment.
We dont have tests for recoverPartitions refreshByPath and refreshTable ?
There was a problem hiding this comment.
sharp eyes :) I was planning to add tests.
I tested these manually, but the steps are more involved and these are only thin wrappers in R I think we should defer to scala tests.
| dataFrame(sdf) | ||
| } | ||
|
|
||
| createExternalTable <- function(x, ...) { |
There was a problem hiding this comment.
I agree that createTable sounds very general, but I dont think its used by base R or any popular R package ?
| first <- strsplit(msg[2], "\r?\n\tat")[[1]][1] | ||
| stop(paste0(rmsg, "analysis error - ", first), call. = FALSE) | ||
| } else | ||
| if (any(grep("org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: ", stacktrace))) { |
There was a problem hiding this comment.
Yes. I knew it. See the JIRA: https://issues.apache.org/jira/browse/SPARK-19952. @hvanhovell plans to remove it in 2.3.0
…s(), test for recoverPartitions/refresh*
|
Test build #75447 has finished for PR 17483 at commit
|
|
@gatorsmile I added a line about recoverPartitions, I think we should also be more clear in other language bindings? |
|
Test build #75451 has finished for PR 17483 at commit
|
|
Test build #75452 has finished for PR 17483 at commit
|
|
merged to master. thanks for the review! |
…createExternalTable ## What changes were proposed in this pull request? Following up on apache#17483, add createTable (which is new in 2.2.0) and deprecate createExternalTable, plus a number of minor fixes ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#17511 from felixcheung/rceatetable.
| #' @note tables since 1.4.0 | ||
| tables.default <- function(databaseName = NULL) { | ||
| sparkSession <- getSparkSession() | ||
| jdf <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "getTables", sparkSession, databaseName) |
There was a problem hiding this comment.
if I am not mistaken, the method getTables() is not used any more in R. Can we remove it from r.SQLUtils:
spark/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
Lines 219 to 226 in e8982ca
? cc @HyukjinKwon
### What changes were proposed in this pull request? Remove the unused method `getTables()` from `r.SQLUtils`. The method was used before the changes #17483 but R's `tables.default` was rewritten using `listTables()`: https://github.com/apache/spark/pull/17483/files#diff-2c01472a7bcb1d318244afcd621d726e00d36cd15dffe7e44fa96c54fce4cd9aR220-R223 ### Why are the changes needed? To improve code maintenance, and remove the dead code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By R tests. Closes #30527 from MaxGekk/remove-getTables-in-r-SQLUtils. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
Add a set of catalog API in R
https://github.com/apache/spark/pull/17483/files#diff-6929e6c5e59017ff954e110df20ed7ff
How was this patch tested?
manual tests, unit tests